Working With Data

October 27, 2023

What are Tidy Data?

Each column represents a single variable
Each row represents a single observation
Each cell represents a single value

Tidy Data Example

What are Clean Data?

Column names are easy to work with and are not duplicated
Missing values have been dealt with
There are no repeated observations or columns
There are no blank observations or columns
The data are in the proper format, for example dates should be formatted as dates

Messy Data Example

Step 1: Downloaded Some Messy Data

Created a folder called “data” in your course modules project folder
Downloaded some messy WB data

Step 2: Read the Data Into R

# Load packages
library(readr) 
library(dplyr) 

# Read data from csv file into an object called "wb_data_messy"
wb_data_messy <- read_csv("data/your_file_name.csv")

# View the data
glimpse(wb_data_messy)

Viewed with View("wb_data_messy") or by clicking on object…

Step 3: Wrote a New CSV File

write_csv(wb_data_messy, "data/your_new_file_name.csv")

With Excel

Can do the same thing with Excel files
Have a look at the readr cheatsheet
Follow same steps as with CSV file
- use read_excel() to read in the data
- install and experiment with writexl

Messy Data Example

Pivot Longer

pivot_longer() takes three arguments:

cols - which columns you want to pivot
names_to - the name of the column where the old column names are going to (identifier)
values_to - the name of the column where the values are going to

Can you remember how to make pivot_longer() work?

# Load tidyr
library(tidyr)

# Reshape the data
      <-    |>
  pivot_longer(
    cols = 
    names_to = 
    values_to = 
  )

glimpse()

# Load tidyr
library(tidyr)

# Reshape the data
wb_data <- wb_data_messy |> 
  pivot_longer(             
    cols = `1973 [YR1973]`: `2022 [YR2022]`, # columns to pivot
    names_to = "year", # name the identifier column "year"
    values_to = "flfp" # name the numeric var column "flfp"
  ) 

# View the data
glimpse(wb_data)

Mutating Variables

Anytime we want to change a variable, we are going to use the dplyr verbs mutate() or mutate_at()
mutate() is if you want to change on variable
mutate_at() is for multiple variables

Fixing Our Variables

# Fix year and flfp
wb_data <- wb_data |> # replace wb_data with a modified version of the dataframe 
  mutate(year = substring(year, 1, 4)) |> # truncate year (keep first four characters)
  mutate_at(c("year", "flfp"), as.numeric) # change year and flfp to numeric

# View the data
glimpse(wb_data)

Clean Variable Names

Very simple: use the janitor package!

# Load janitor
library(janitor)

# Apply clean_names() to wb_data, store in new data frame called wb_data_clean
wb_data_clean <- wb_data |>  
  clean_names() 

# Write wb_data_clean to a csv file
write_csv(wb_data_clean, "data/wb_data_clean.csv")

# View the data
glimpse(wb_data_clean)

APIs

API stands for “Application Programming Interface”
Way for two computers to talk to each other
In our case, we will use APIs to download social science data
APIs are accessed through packages in R
Sometimes there can be more than one package for an API
Usually easier than reading in data from messy flat file

Datasets

World Bank data through wbstats
- There is another package called WDI
Varieties of Democracy (V-Dem) through vdemdata
- This is the only package for V-Dem
- No arguments
- Just downloads all the data!

Searching for WB Indicators

flfp_indicators <- wb_search("female labor force") # store the list of indicators

print(flfp_indicators, n=26) # view the indicators

Try it Yourself

Start a new Quarto document
Start a new code chunk
Load the wbstats package
Use wb_search() to find some indicators you are interested in
If search is not successful, visit WDI site

05:00

`wbstats` Example

# Load packages
library(wbstats) # for downloading WB data
library(dplyr) # for selecting, renaming and mutating
library(janitor) # for rounding

# Store the list of indicators in an object
indicators <- c("flfp" = "SL.TLF.CACT.FE.ZS", "women_rep" = "SG.GEN.PARL.ZS") 

# Download the data  
women_emp <- wb_data(indicators, mrv = 50) |> # download data for last 50 yrs
  select(!iso2c) |> # drop the iso2c code which we won't be using
  rename(year = date) |> # rename date to year 
  mutate(
    flfp = round_to_fraction(flfp, denominator = 100), # round to nearest 100th
    women_rep = round_to_fraction(women_rep, denominator = 100) 
  )

# View the data
glimpse(women_emp)

Try it Yourself

Using the variables you identified in your search, download the data
Think about what elements of the code you need to change
Hint: the indicators are going to change!

05:00

Downloading V-Dem Data

The vdem function from vdemdata just downloads the data
There are no parameters for selecting indicators or date ranges
So we have to do more work with dplyr functions
- filter() for years
- select() for variables
We are also going to use case_match() to add

`vdemdata` Example

# Load packages
library(vdemdata) # to download V-Dem data

# Download the data
democracy <- vdem |> # download the V-Dem dataset
  filter(year >= 1990)  |> # filter out years less than 1990
  select(                  # select (and rename) these variables
    country = country_name,     # the name before the = sign is the new name  
    vdem_ctry_id = country_id,  # the name after the = sign is the old name
    year, 
    polyarchy = v2x_polyarchy, 
    gdp_pc = e_gdppc, 
    region = e_regionpol_6C
    ) |>
  mutate(
    region = case_match(region, # replace the values in region with country names
                     1 ~ "Eastern Europe", 
                     2 ~ "Latin America",  
                     3 ~ "Middle East",   
                     4 ~ "Africa", 
                     5 ~ "The West", 
                     6 ~ "Asia")
                    # number on the left of the ~ is the V-Dem region code
                    # we are changing the number to the country name on the right
                    # of the equals sign
  )

# View the data
glimpse(democracy)

Try it Yourself

Go to the V-Dem Codebook
Select a democracy indicator from Part II and one other indicator
Note the indicator code (e.g. “v2x_polyarchy” for the polyarchy score)
In your Quarto notes document, create a new code chunk
Use the codes for your selected indicators to download the data

05:00

Working With Data

What are Tidy Data?

Tidy Data Example

What are Clean Data?

Messy Data Example

Step 1: Downloaded Some Messy Data

Step 2: Read the Data Into R

Step 3: Wrote a New CSV File

With Excel

Messy Data Example

Pivot Longer

Mutating Variables

Fixing Our Variables

Clean Variable Names

APIs

Datasets

Searching for WB Indicators

Try it Yourself

wbstats Example

Try it Yourself

Downloading V-Dem Data

vdemdata Example

Try it Yourself

`wbstats` Example

`vdemdata` Example